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The Use of Non-ASCII Characters in RFCs 
Abstract 


In order to support the internationalization of protocols and a more 
diverse Internet community, the RFC Series must evolve to allow for 
the use of non-ASCII characters in RFCs. While English remains the 
required language of the Series, the encoding of future RFCs will be 
in UTF-8, allowing for a broader range of characters than typically 
used in the English language. This document describes the RFC Editor 
requirements and gives guidance regarding the use of non-ASCII 
characters in RFCs. 


This document updates RFC 7322. Please view this document in PDF 
form to see the full text. 


Status of This Memo 


This document is not an Internet Standards Track specification; it is 
published for informational purposes. 


This document is a product of the Internet Architecture Board (IAB) 
and represents information that the IAB has deemed valuable to 
provide for permanent record. It represents the consensus of the 
Internet Architecture Board (IAB). Documents approved for 
publication by the IAB are not a candidate for any level of Internet 
Standard; see Section 2 of RFC 7841. 


Information about the current status of this document, any errata, 


and how to provide feedback on it may be obtained at 
http://www. rfc-editor.org/info/rfc7997. 
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1. 


Introduction 
Please review the PDF version of this document. 


For much of the history of the RFC Series, the character encoding 
used for RFCs has been ASCII [RFC20]. This was a sensible choice at 
the time: the language of the Series has always been English, a 
language that primarily uses ASCII-encoded characters (ignoring for a 
moment words borrowed from more richly decorated alphabets); and, 
ASCII is the "lowest common denominator" for character encoding, 
making cross-platform viewing trivial. 


There are limits to ASCII, however, that hinder its continued use as 
the exclusive character encoding for the Series. The increasing need 
for easily readable, internationalized content suggests it is time to 
allow non-ASCII characters in RFCs where necessary. To support this 
move away from ASCII, RFCs will switch to supporting UTF-8 as the 
default character encoding and will allow support for a broad range 
of Unicode characters [UnicodeCurrent]. Note that the RFC Editor may 
reject any code point that does not render adequately across all 
formats or in enough rendering engines using the v3 tooling. 


Given the continuing goal of maximum readability across platforms, 
the use of non-ASCII characters should be limited to only where 
necessary within the text. This document describes the rules under 
which non-ASCII characters may be used in an RFC. These rules will 
be applied as the necessary changes are made to submission checking 
and editorial tools. 


This document updates the RFC Style Guide [RFC7322]. 


The details included in this document are expected to change based on 
experience gained in implementing the new publication toolsets. 
Revised documents will be published capturing those changes as the 
toolsets are completed. Other implementers must not expect those 
changes to remain backwards compatible with the details included in 
this document. 


Basic Requirements 


Two fundamental requirements inform the guidance and examples 
provided in this document. They are: 


o Searches against RFC indexes and database tables need to return 
expected results and support appropriate Unicode string matching 
behaviors; 
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3. 


3. 


o RFCs must be able to be displayed correctly across a wide range of 
readers and browsers. People whose systems do not have the fonts 
needed to display a particular RFC need to be able to read the 
various publication formats and the XML correctly in order to 
understand and implement the information described in the 
document. 


Rules for the Use of Non-ASCII Characters 


This section describes the guidelines for the use of non-ASCII 
characters in an RFC. If the RFC Editor identifies areas where the 
use of non-ASCII characters negatively impacts the readability of the 
text, they will request alternate text. 


The RFC Editor may, in cases of entire words represented in non-ASCII 
characters, ask for a set of reviewers to verify the meaning, 
spelling, characters, and grammar of the text. 


1, General Usage throughout a Document 


Where the use of non-ASCII characters is purely part of an example 
and not otherwise required for correct protocol operation, escaping 
the non-ASCII character is not required. Note, however, that as the 
language of the RFC Series is English, the use of non-ASCII 
characters is based on the spelling of words commonly used in the 
English language following the guidance in the Merriam-Webster 
dictionary [MerrWeb]. 


The RFC Editor will use the primary spelling listed in that 
dictionary by default. 


Example of non-ASCII characters that do not require escaping (example 
from Section 3.1.1.12 of RFC 4475 [RFC4475], with a hex dump replaced 
by the actual character glyphs): 


This particular response contains unreserved and non-ASCII 
UTF-8 characters. 
This response is well formed. A parser must accept this message. 


Message Details : unreason 


SIP/2.0 200 = 2**3 x 5**2 HO CTO npeBaHocTO neBaTtb — npocroe 
Via: SIP/2.0/UDP 192.0.2.198; branch=z9hG4bK1324923 

Call-ID: unreason.1234ksdfak3j2erwedfsASdf 

CSeq: 35 INVITE 

From: sip:user@examp le. com; tag=11141343 

To: sip:user@examp le.edu; tag=2229 Content-Length: 154 
Content-Type: application/sdp 
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3.2. Person Names 


Person names may appear in several places within an RFC (e.g., the 
header, Acknowledgements, and References). When a script outside the 
Unicode Latin blocks [UNICODE-CHART] is used for an individual name, 
an author-provided, ASCII-only identifier will appear immediately 
after the non-Latin characters, surrounded by parentheses. This will 
improve general readability of the text. 


Example header: 


OLD: 

Internet Engineering Task Force (IETF) J. Tong 
Request for Comments: 7380 C. Bi, Ed. 
Category: Standards Track China Telecom 
ISSN: 2070-1721 R. Even 
Q. Wu, Ed. 
R. Huang 
Huawei 
November 2014 

PROPOSED/NEW: 
Internet Engineering Task Force (IETF) J. Tong 
Request for Comments: 7380 C. Bi, Ed. 
Category: Standards Track China Telecom 
ISSN: 2070-1721 px 2277 (R. Even) 
ZER (Q. Wu), Ed. 
R. Huang 
Huawei 


November 2014 
Example Acknowledgements section: 
OLD: 


The following people contributed significant text to early versions 
of this draft: Patrik Faltstrom, William Chan, and Fred Baker. 


PROPOSED/NEW: 


The following people contributed significant text to early versions 
of this draft: Patrik Faltstrom (Faltstrom), PRS (William Chan), 
and Fred Baker. 
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Example reference entry: 


OLD: 
[RFC6630] Cao, Z., Deng, H., Wu, Q., and G. Zorn, Ed., "EAP 
Re-authentication Protocol Extensions for Authenticated 
Anticipatory Keying (ERP/AAK)", RFC 6630, June 2012. 
NEW: 


[RFC6630] Cao, Z., Deng, H., =%k (Wu, Q.), and G. Zorn, Ed., "EAP 
Re-authentication Protocol Extensions for Authenticated 
Anticipatory Keying (ERP/AAK)", RFC 6630, June 2012. 


3.3. Company Names 


Company names may appear in several places within an RFC. In all 
cases, valid Unicode is required. For names that include characters 
outside of the Unicode Latin and Latin Extended scripts, an author- 
provided, ASCII-only identifier is required to assist in searching 
and indexing of the document. 


3.4. Body of the Document 


When the mention of non-ASCII characters is required for correct 
protocol operation and understanding, the characters' Unicode code 
points must be used in the text. The addition of each character name 
is encouraged. 


o Non-ASCII characters will require identifying the Unicode code 
point. 


o Use of the actual UTF-8 character (e.g., A) is encouraged so 
that a reader can more easily see what the character is, if their 
device can render the text. 


o The use of the Unicode character names like "INCREMENT" in 
addition to the use of Unicode code points is also encouraged. 
When used, Unicode character names should be in all capital 
letters. 
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Examples: 
OLD [RFC7564]: 


However, the problem is made more serious by introducing the full 
range of Unicode code points into protocol strings. For example, 
the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 from 
the Cherokee block look similar to the ASCII characters 

"STPETER" as they might appear when presented using a "creative" 
font family. 


NEW/ALLOWED: 


However, the problem is made more serious by introducing the full 
range of Unicode code points into protocol strings. For example, 
the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 
(STPETER) from the Cherokee block look similar to the ASCII 


characters "STPETER" as they might appear when presented using a 
"creative" font family. 


ALSO ACCEPTABLE: 

However, the problem is made more serious by introducing the full 
range of Unicode code points into protocol strings. For example, 
the characters "STPETER" (U+13DA U+13A2 U+13B5 U+13AC U+13A2 
U+13AC U+13D2) from the Cherokee block look similar to the ASCII 
characters "STPETER" as they might appear when presented using a 
"creative" font family. 

Example of proper identification of Unicode characters in an RFC: 
Acceptable: 


o Temperature changes in the Temperature Control Protocol are 
indicated by the U+2206 character. 


Preferred: 


1. Temperature changes in the Temperature Control Protocol are 
indicated by the U+2206 character ("A"). 


2. Temperature changes in the Temperature Control Protocol are 
indicated by the U+2206 character (INCREMENT). 


3. Temperature changes in the Temperature Control Protocol are 
indicated by the U+2206 character ("A", INCREMENT). 
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4. Temperature changes in the Temperature Control Protocol are 
indicated by the U+2206 character (INCREMENT, "A"). 


5. Temperature changes in the Temperature Control Protocol are 
indicated by the "Delta" character "A" (U+2206). 


6. Temperature changes in the Temperature Control Protocol are 
indicated by the character "A" (INCREMENT, U+2206). 


Which option of (1), (2), (3), (4), (5), or (6) is preferred may 
depend on context and the specific character(s) in question. All are 
acceptable within an RFC. "US-ASCII Escaping of Unicode Character" 
[BCP137] describes the pros and cons of different options for 
identifying Unicode characters and may help authors decide how to 
represent the non-ASCII characters in their documents. 


3.5. Tables 


Tables follow the same rules for identifiers and characters as in 
"Body of the Document" (Section 3.4). If it is sensible (i.e., more 
understandable for a reader) for a given document to have two tables, 
—— one including the identifiers and non-ASCII characters and a 
second with just the non-ASCII characters —- then that will be 
allowed at the discretion of the authors. 


Original text from "Preparation, Enforcement, and Comparison of 


Internationalized Strings Representing Usernames and Passwords" 
[RFC7613]. 
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Table 3: A sample of legal passwords 


+ em e ms e e qu pe e ue um e e us SS SSS SY YS SS SNE e e e o e a SSN SE e e me 
| # | Password 


Preferred text: 
Table 3: A sample of legal passwords 


+ e e me em e A e e e e SS e e e SEND e ND am: 
| # | Password 


$—— +—— +——— +— +— 4+— + 


$—— +—— +" +— +— +— + 


mm ee to cus e ps su SS e SS SR NY SENDS SEN o e au SNS GERD e e cm + 
Notes | 
— cu o us e e e e e ts e e e e e e cs mm + 
ASCII space is allowed | 
—— us o e e us e e e a e e e e e e e e e e e a te e a a + 
Different from example 12 | 
——— o o e ce e e e e fe e e e e e e e e e e e e e e e e mm + 
Non-ASCII letters are OK | 
(e.g., GREEK SMALL LETTER | 
PI, U+03C0) | 
e me sem nem em o e e e e e e e e e SS e e SD a SD SEND SD e a e + 
Symbols are OK (e.g., BLACK | 
DIAMOND SUIT, U+2666) | 
< ums cm ue o e ce e e e e o e e e e e a e e e e e a e e a e es + 
OGHAM SPACE MARK, U+1680, is | 
mapped to U+0020 and thus | 
the full string is mapped to | 
<foo bar> | 
em pe te cus ps e cu SS SS e e SSS SY e o SE o e a eu ue SN ue eme + 
en + 
Notes | 
em SS cam me o e a e a e e e SS e e NS SS SY e e UR SED SEND ED ca + 
ASCII space is allowed | 
e eme e tem em o e a SS us SS SS e SSD o NED e SED EEN e e + 
Different from example 12 | 
em a me o ca mo e tu mo cm as e e e e e NY SEEN SR e SD cE e e e ca + 
Non-ASCII letters are OK | 
(e.g., GREEK SMALL LETTER | 
PI, U+03C0; LATIN SMALL | 
LETTER SHARP S, U+00DF; THAI | 
DIGIT SEVEN, U+0E57) | 
em o ee tes quo e e SS e e e SS e ue e e e e SD GD SN e meo + 
Symbols are OK (e.g., BLACK | 
DIAMOND SUIT, U+2666) | 
sm a te eo ca um e o ue SSS e fe e SS SER SN SEES SOE e e SD e SNE e ca + 
OGHAM SPACE MARK, U+1680, is | 
mapped to U+0020 and thus | 
the full string is mapped to | 
<foo bar> | 
em em e o am e ae tas om SS e e SSN e a SS e e SY e e UN END me e ca + 
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3.6. Code Components 


The RFC Editor encourages the use of the U+ notation except within a 
code component where one must follow the rules of the programming 
language in which the code is being written. 


Code components are generally expected to use fixed-width fonts. 
Where such fonts are not available for a particular script, the best 
script-appropriate font will be used for that part of the code 
component. 


3.7. Bibliographic Text 


The reference entry must be in English; whatever subfields are 
present must be available in ASCII-encoded characters. For 
references to RFCs and Internet-Drafts, the author's name will be 
formatted in the reference as per current RFC Style Guide 
recommendations. As long as good sense is used, the reference entry 
may also include non-ASCII characters at the author's discretion and 
as provided by the author. The RFC Editor may request that a third 
party, such as a language specialist or subject matter expert, review 
of any non-ASCII reference. This applies to both normative and 
informative references. 


Example: 


[G0ST3410] "Information technology. Cryptographic data security. 
Signature and verification processes of [electronic] 
digital signature.", GOST R 34.10-2001, Gosudarstvennyi 
Standard of Russian Federation, Government Committee of 
Russia for Standards, 2001. (In Russian) 


Allowable addition to the above citation: 
"NHþopmaųnoHHaa TExHonorua. Kpuntorpaquueckas santa 
nHþopmaynn. Mpoueccs| PopmupoBaHua n NpoBepKu 
3nekTpoHHOA LumnPpoBou nognucu", GOST R 34.10-2001, 
FocynapctBeHHbiA craHgapt PoccuncKkon denepaunn, 2001. 


Alternatively: 

[G0ST3410] "Information technology. Cryptographic data security. 
Signature and verification processes of [electronic] 
digital signature.", GOST R 34.10-2001, Gosudarstvennyi 
Standard of Russian Federation, MpaButenbcTBeHHaa 
Komuccua Poccuu no craHgaprtram (Government Committee of 
Russia for Standards), 2001. (In Russian) 
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3.8. Keywords and Citation Tags 


Keywords (as tagged with the <keyword> element in XML) and citation 
tags (as defined in the anchor attributes of <reference> elements) 
must contain only ASCII characters. 


3.9. Address Information 


The purpose of providing address information, either postal or email, 
is to assist readers of an RFC in contacting the author or authors. 
Authors may include the official postal address as recognized by 
their company or local postal service without additional non-ASCII 
character escapes. If the email address includes non-ASCII 
characters and is a valid email address at the time of publication, 
non-ASCII character escapes are not required. 


Example: 


Qin Wu (editor) 

Huawei 

101 Software Avenue, Yuhua District 
Nanjing, Jiangsu 210012 

China 


Additional contact information: 


Et (editor) 
LE ATLA BRAS] 
RIE KER AIE1015 
LAR 210012 

rh 


Roni Even 

Huawei 

14 David Hamelech 
Tel Aviv 64953 
Israel 
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Additional contact information: 


12N 0na 
MNN 

14 jYnn NT 
64953 2AN DN 


DNIWW? 


4. Normalization Forms 


Authors should not expect normalization forms [UNICODE-NORM] to be 
preserved. If a particular normalization form is expected, note that 
in the text of the RFC. 


5. XML Markup 


As described above, use of non-ASCII characters in areas such as 
email, company name, address, and name is allowed. In order to make 
it easier for code to identify the appropriate ASCII alternatives, 
authors must include an "ascii" attribute to their XML markup when an 
ASCII alternative is required. See [RFC7991] for more detail on how 
to tag ASCII alternatives. 


6. Internationalization Considerations 
The ability to use non-ASCII characters in RFCs in a clear and 
consistent manner will improve the ability to describe 
internationalized protocols and will recognize the diversity of 
authors. However, the goal of readability will override the use of 
non-ASCII characters within the text. 

7. Security Considerations 


Valid Unicode that matches the expected text must be verified in 
order to preserve expected behavior and protocol information. 
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